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Enclosure  1 


1  Problems  studied 

The  main  focus  of  the  research  is  on  the  accuracy,  interpretability,  and  vi- 
sualizability  of  classification  and  regression  trees.  This  is  partly  motivated 
by  the  recent  interest,  within  the  statistics  and  data  mining  communities,  in 
averaging  across  ensembles  of  trees  to  increase  prediction  accuracy  (Breiman 
1996,  Breiman  1998).  A  serious  disadvantage  of  ensemble  techniques  is  the 
impracticality  of  simultaneously  interpreting  more  than  a  very  small  number 
of  trees.  This  ironic,  as  research  in  tree-structured  methods  was  originally 
motivated  by  the  desire  for  an  interpretable  alternative  to  standard  methods 
such  as  multiple  linear  regression  and  neural  networks. 

Another  problem  with  most  tree  construction  algorithms  is  that  their 
variable  selection  methods  are  biased  towards  choosing  some  types  of  vari¬ 
ables  over  others.  As  a  result,  conclusions  drawn  from  such  trees  can  be,  and 
often  are,  wrong. 

A  primary  goal  of  our  project  is  to  design  algorithms  for  trees  of  sufficient 
complexity  (not  in  terms  of  size  but  in  the  type  of  splits  and  node  models) 
that  the  accuracy  of  a  single  tree  is  comparable  to  that  of  an  ensemble  of  trees. 
In  addition,  the  trees  are  free  of  variable  selection  bias.  Another  important 
practical  goal  is  to  implement  the  algorithms  into  high-quality  computer 
software  for  Windows,  Linux,  Macintosh  and  other  operating  systems. 

2  Summary  of  important  results 

2.1  Linear  regression  trees 

The  most  progress  is  made  in  this  area,  because  the  PI  is  solely  and  totally  re¬ 
sponsible  for  the  design  and  implementation  of  the  GUIDE  algorithm.  At  the 
time  that  this  report  is  written,  GUIDE  has  already  far  out-paced  all  other 
regression  tree  software,  including  well-known  ones  such  as  CART  (Breiman, 
Friedman,  Olshen  and  Stone  1984)  and  M5  (Quinlan  1992).  With  the  excep¬ 
tion  of  the  lesser-known  RT  (Torgo  1999),  the  other  methods  are  exclusively 
designed  for  piecewise-constant  least  squares  regression.  GUIDE  can  pro¬ 
duce  piecewise-polynomial  and  piecewise-multiple  linear  (including  stepwise 
regression)  models  as  well.  Further,  GUIDE  is  the  only  algorithm  that  can  fit 
quantile  and  Poisson  regression  models.  But  the  most  important  and  unique 
feature  of  GUIDE  is  that  its  variable  selection  procedure  for  splitting  each 
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node  is  bias-corrected.  This  is  a  very  tricky  problem  that  no  other  regres¬ 
sion  tree  algorithm  has  been  able  to  solve.  Without  bias-correction,  a  tree 
model  can  be  worse  than  useless  for  interpretation  because  of  the  potential 
for  incorrect  inferences. 

The  GUIDE  bias  correction  approach  for  least  squares  and  Poisson  re¬ 
gression  is  explained  in  article  [4];  it  is  extended  to  quantile  regression  in  [6] 
(article  numbers  refer  to  those  in  Section  3.1).  Two  other  manuscripts  have 
been  submitted  for  publication.  One  demonstrates  how  GUIDE  can  be  used 
to  visualize  high- dimensional  datasets.  It  also  contains  the  results  of  a  large- 
scale  empirical  study  showing  GUIDE  to  have  as  good  prediction  accuracy 
as  the  best  tree  or  non-tree  regression  algorithms,  the  latter  including  spline 
models  such  as  GAM  (Hastie  and  Tibshirani  1990)  and  MARS  (Friedman 
1991),  and  rule-based  models  from  the  computer  science  literature.  Another 
manuscript  ilustrates  the  advantages  of  GUIDE  in  fitting  models  to  classi¬ 
cal  factorial  experiments  where  the  sample  sizes  are  small.  The  titles  of  the 
articles  are  listed  in  Section  3.2. 

2.2  Logistic  regression  trees 

The  main  appeal  of  a  logistic  regression  tree  over  a  classification  tree  is  its 
ability  to  classify  as  well  as  to  attach  a  probability  to  each  subject.  The 
latter  allows  the  ranking  of  subjects  which  is  important  to  the  service  in¬ 
dustry,  for  example.  Thus,  using  a  logistic  regression  tree,  a  company  can 
determine  the  top  20%  of  its  customers  most  likely  to  be  dissatisfied  with 
its  service/product,  and  try  to  improve  its  product  design  and  services  to 
achieve  higher  customer  satisfaction. 

Three  of  the  Pi’s  students  have  written  PhD  theses  on  this  problem  (Lo 
1993,  Potter  1998,  Chan  2000).  The  latest  algorithm  is  called  LOTUS.  Like 
GLIIDE,  its  distinguishing  feature  is  that  it  has  negligible  variable  selection 
bias.  Further,  it  is  relatively  computation  inexpensive  and  can  handle  both 
numerical  and  categorical  covariates.  Finally,  it  is  flexible  in  the  type  of 
linear  model  used  and  has  a  built-in  mechanism  to  handle  missing  values. 
Evaluations  based  on  real  and  simulated  data  show  that  LOTUS  performs 
well  in  most  situations.  The  results  are  reported  in  articles  [9]  and  [11]  in 
Section  3.1. 
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2.3  Classification  trees 

The  QUEST  algorithm  (Loh  and  Shih  1997)  was  stable  during  the  period  of 
this  research.  The  software  received  a  small  number  of  bug  fixes.  QUEST 
is  the  Pi’s  most  popular  software,  as  measured  by  the  hundred  or  so  hits  its 
website  receives  each  week.  This  is  probably  due  to  its  maturity  (the  first 
version  was  released  more  than  ten  years  ago)  and  to  the  adoption  of  its  basic 
algorithm  by  commercial  publishers  SPSS  Inc.,  and  StatSoft. 

The  most  significant  new  development  in  this  area  is  the  CRUISE  al¬ 
gorithm,  which  extends  the  unbiasedness  of  QUEST  in  several  directions. 
First,  it  splits  each  node  of  a  tree  into  as  many  branches  as  the  number  of 
values  taken  by  the  response  variable.  When  the  dataset  is  very  large,  this 
has  the  advantage  of  producing  a  shorter,  and  hence  more  comprehensible, 
tree.  CHAID  (Kass  1980)  is  the  only  other  algorithm  with  non-binary  splits. 
But  CHAID  is  otherwise  quite  different,  because  the  number  of  splits  is  fixed 
at  ten  for  each  ordered  predictor  variable  and  the  tree  is  not  pruned. 

The  other  unique  feature  is  CRUISE’s  sensitivity  to  local  interactions 
between  pairs  of  variables.  All  other  classification  tree  algorithms  concentrate 
only  on  one  variable  at  a  time.  Two  papers  on  CRUISE  were  published  during 
the  period  of  the  grant.  They  are  [2]  and  [7]  in  Section  3.1.  A  third  paper  [8] 
gives  an  application  of  CRUISE  to  a  problem  in  construction  engineering. 

3  Publications  and  technical  reports 

3.1  Peer-reviewed  journals 

1.  Asymptotic  theory  for  Box-Cox  transformations  in  linear  models  (with 
K.  Clio,  I.  Yeo,  and  R.  A.  Johnson).  Statistics  and  Probability  Letters , 
2001,  51,  337-343. 

2.  Prediction  interval  estimation  in  transformed  linear  models  (with  K. 
Clio,  I.  Yeo,  and  R.  A.  Johnson).  Statistics  and  Probability  Letters, 
2001,  51,  345-350. 

3.  Classification  trees  with  unbiased  multiway  splits  (with  H.  Kim).  Jour¬ 
nal  of  the  American  Statistical  Association,  2001,  96,  589-604. 

4.  Regression  trees  with  unbiased  variable  selection  and  interaction  detec¬ 
tion.  Statistica  Sinica ,  2002,  12,  361-386. 
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5.  A  framework  for  measuring  differences  in  data  characteristics  (with  V. 
Ganti,  J.  Gehrke  and  R.  Ramakrishnan) .  Journal  of  Computer  and 
System  Sciences  2002,  64,  542-578. 

6.  Nonparametric  estimation  of  conditional  quantiles  using  quantile  re¬ 
gression  trees  (with  P.  Chaudhuri).  Bernoulli ,  2002,  8,  561-576. 

7.  Classification  trees  with  bivariate  linear  discriminant  node  models  (with 
H.  Kim).  Journal  of  Computational  and  Graphical  Statistics,  2003,  12, 
512-530. 

8.  Decision  tree  approach  to  classify  and  quantify  cumulative  impact  of 
change  orders  on  productivity  (with  M.  J.  Lee  and  A.  S.  Hanna).  Jour¬ 
nal  of  Computing  in  Civil  Engineering ,  2004,  18,  132-144. 

9.  LOTLTS:  An  algorithm  for  building  accurate  and  comprehensible  logis¬ 
tic  regression  trees  (with  K-Y  Chan).  Journal  of  Computational  and 
Graphical  Statistics ,  2004,  13,  826-852. 

10.  Box-Cox  transformations.  Book  chapter  in  Encyclopedia  of  Statistical 
Sciences ,  2nd  edition,  Wiley.  In  press. 

11.  Logistic  regression  tree  analysis.  Book  chapter  in  Handbook  of  Engi¬ 
neering  Statistics,  Springer.  In  press. 

3.2  Manuscripts  submitted 

1.  A  visualizable  and  interpretablc  regression  model  with  good  prediction 
power  (with  H.  Kim,  Y.-S.  Shih,  and  P.  Chaudhuri).  Submitted  to  HE 
Transactions  Special  Issue  on  Data  Mining. 

2.  Regression  tree  models  for  designed  experiments.  Submitted  to  Pro¬ 
ceedings  of  the  Second  Lehmann  Symposium. 


4  Scientific  personnel  and  advanced  degrees 
earned 

Two  PhD  students  were  supported  as  research  assistants  at  various  times 
during  the  grant  period: 
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1.  Hyungjun  Clio 

2.  Qinghua  Song 

Hyungjun  Clio  received  his  PhD  degree  under  the  Pi’s  supervision  in  2002 
(Clio  2002).  He  is  currently  a  postdoctoral  fellow  at  the  University  of  Vir¬ 
ginia.  Qinghua  Song  is  expected  to  complete  his  degree  in  the  next  twelve 
months. 


5  Software  developed 

The  executable  binaries  (for  Linux,  Windows,  and  others)  of  the  following 

algorithms  are  being  distributed  free  from  the  PPs  website  http://www. 

stat .wise . edu/~loh/. 

1.  Cruise  classification  tree 

2.  Guide  regression  tree 

3.  Lotus  logistic  regression  tree 

4.  Quest  classification  tree 
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